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Abstract — As energy proportional computing gradually ex- 
tends tlie success of DVFS (Dynamic voltage and frequency 
scaling) to the entire system, DVFS control algorithms will play 
a key role in reducing server clusters' power consumption. The 
focus of this paper is to provide accurate cluster-level DVFS 
control for power saving in a server cluster. To achieve this goal, 
we propose a request tracing approach that online classifies the 
major causal path patterns of a multi-tier service' and monitors 
their performance data of each tier as a guide for accurate DVFS 
control. The request tracing approach significantly decreases 
the time cost of performance profiling experiments that aim 
to establish the empirical performance model. Moreover, it 
decreases the controller complexity so that we can introduce 
a much simpler feedback controller, which only relies on the 
single-node DVFS modulation at a time as opposed to varying 
multiple CPU frequencies simultaneously. Based on the request 
tracing approach, we present a hybrid DVFS control system that 
combines an empirical performance model for fast modulation 
at different load levels and a simpler feedback controller for 
adaption. We implement a prototype of the proposed system, 
called PowerTracer, and conduct extensive experiments on a 3- 
tier platform. Our experimental results show that PowerTracer 
outperforms its peer in terms of power saving. 

I. Introduction 

In data centers, most of services adopt the multi-tier archi- 
tecture, and services are replicated or distributed on a cluster 
of servers [15]. Server cluster operating costs arising from 
increasing energy consumption of non energy-proportional 
server hardware [17] [41] have prompted the investigation 
of cluster-level power management. Current solutions that 
save cluster power consumption can be classified into three 
main categories: (1) DVFS (Dynamic voltage and frequency 
scaling), (2) dynamic cluster reconfiguration by consolidating 
services through request distribution [13] [31], and (3) server 
consolidation through moving services to virtual machines 
[39] [38] [28] [43]. 

On one hand, reconfiguration approaches can greatly re- 
duce cluster power consumption by consolidating services 
through request distribution [13] or moving services to virtual 
machines[39] on a subset of physical nodes [39] and turn 
off the remaining ones. However, as claimed by D. Meisner 
et al [39], reconfiguration alone, cannot close the gap of 
resource demands between peak and average workloads, and 
server clusters still require sufficient capacity for peak resource 

'Triggered by an individual request, a causal path is a sequence of com- 
ponent activities with causal relations. Major causal path patterns represent 
repeatedly executed causal paths that account for significant fractions. 



demand. On the other hand, as energy proportional computing 
[17] gradually extends the success of DVFS to the entire 
system [39], accurate DVFS control algorithms are essential to 
reducing cluster power consumption. Therefore, the motivation 
of this work is how to leverage commercial DVFS technolo- 
gies, which exemplifies the energy-proportional concept [39], 
on physical nodes to save cluster power consumption. 

In this paper, we propose a novel request tracing approach 
[18] [8] for accurate cluster-level DVFS control. Through in- 
strumentation, which can be performed on operating system 
kernel, middleware or application levels, our request tracing 
tool can obtain major causal path patterns (in short, patterns) 
in serving different requests and capture server-side latency, 
especially service time of each tier in different patterns. 
Thus, we can fully understand the role of each component 
played in serving different requests in terms of service time 
percentage. The advantage of the request tracing approach 
in DVFS control lies in two-fold: first, it decreases the time 
cost of performance profiling experiments that aim to obtain 
the performance model; second, it decreases the controller 
complexity. Exploiting major patterns, we propose a simpler 
feedback controller, which only relies on the single-node 
DVFS modulation at a time as opposed to varying multiple 
CPU frequencies in a simultaneous manner. On a basis of this, 
we present a hybrid DVFS control algorithm that combines an 
empirical performance model for fast modulation at different 
load levels and a simpler controller for adaption. 

We implement a prototype of the proposed system, called 
PowerTracer, and conduct extensive experiments on a 3-tier 
platform. We use two kinds of workload of a RUBiS (Rice 
University Bidding System) web application [10] to validate 
the efficacy of PowerTracer in terms of three metrics: total 
system power savings compared to the baseline, request dead- 
line miss ratio, and average server-side latency. In particular, 
we investigate the effects of the server-side latency threshold 
and the number of main patterns upon server cluster power 
consumption and other performance metrics. Our experimental 
results show that PowerTracer outperforms its peer [13] in 
terms of power saving. 

The remainder of this paper is organized as follows. Section 
II presents the background of this work. Section III details the 
system architecture, performance model, and DVS controller 
of PowerTracer. Section IV describes the system implementa- 
tion. Section V presents the experimental results. Section VI 
surveys related work. Finally, Section VII draws a conclusion. 



II. Background 

In this section, we highlight the background information 
of PowerTracer with respect to request tracing, performance 
profiling, and control theory. 

Request tracing. We employ the concept of request tracing 
from our previous work [8], [9]. In Fig. 1, we observe that 
a request causes a series of interaction activities in the OS 
kernel, e.g. sending or receiving messages. Those activities 
happen under specific contexts (processes or kernel threads) 
of different components. We record an activity of sending a 
message as S] , which indicates a process i sends a message 
to a process j. We record an activity of receiving a message 
as Rl , which indicates a process j receives a message from a 
process i. Our concerned activity types include: BEGIN, END, 
SEND, and RECEIVE. SEND and RECEIVE activities are 
those of sending and receiving messages. A BEGIN activity 
marks the start of servicing a new request, while an END 
activity marks the end of servicing a request. 
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Fig. 1. Activities with causal relations in the kernel. 

When an individual request is serviced, a series of activities 
have causal relations or happened-before relationships, and 
constitute a causal path. For example in Fig. 1, the activity se- 
quence {^c u'S'i 21^1 2''S'2.3J^2.3''5'|a,} constitutcs a causal 
path. For each individual request, there is a causal path. For 
a request, the server-side latency can be defined as the time 
difference between the time stamp of BEGIN activity and END 
activity in its corresponding causal path. The service time of 
each tier can be defined as the accumulated processing time 
between SEND activities and RECEIVE activities. For each 
tier, its role in serving a request can be measured in terms of 
service time percentage, which is the ratio of service time of 
the tier to the server-side latency. 

After reconstructing those activity logs into causal paths, we 
classify those causal paths into different patterns. Then, we 
present main patterns to represent repeatedly executed causal 
paths, which account for significant fractions. For each pattern, 
we can obtain the average server-side latency and average 
service time of each tier In addition, through observing the 
number of BEGIN activities, which marks the beginning of 
serving new requests, we can derive the current load level of 
the services. 

Performance profiling. On a basis of the approach used 
in [45], our approach aims to create detailed performance 
profiles of multi-tier services with different DVFS modulations 



under varying workloads. A tier profile captures each tier's 
performance characteristics as a function of DVFS modulation 
and varying loads. In order to profile a tier, we deploy a test 
environment and vary the DVFS setting of a node, on which 
each tier is deployed. We then apply a variety of loads and 
collect each tier's performance characteristics, independent of 
other tiers. After acquiring the measurements, we derive the 
appropriate functions mappings from DVFS modulation and 
load levels to each tier's performance metrics. 

Control theory. Similar to the control-theoretic terminol- 
ogy used in [40], we refer to the server cluster and deployed 
multi-tier service being controlled as the target system, which 
has a set of metrics of interest, referred to as measured output, 
and a set of control knobs, referred to as control input. The 
controller periodically (at each control period) adjusts the 
value of the control input such that the measured output 
(at each sampling period) can match its desired value — 
referred to as reference input specified by the system designer 
We refer to the difference between the measured output and 
the reference input as control error). The controller aims to 
maintain control error at zero, in spite of the disturbance in 
the system. 

III. Architecture and System Design 

In this section, we first describe the architecture of Pow- 
erTracer. Then, we present design details in online request 
tracing, performance profiling, and controller, respectively. 

A. PowerTracer Architecture 

As shown in Fig. 2, PowerTracer consists of three major 
modules: online request tracing, performance profiling, and 
controller The online request tracing module is used to collect 
online performance data of multi-tier services, including main 
patterns, their respective server-side latencies, service time of 
each tier, and current loads. 

The performance profiling module aims to off-line establish 
an empirical performance model under different load lev- 
els. The model can capture the relationship between service 
performance and CPU frequency setting of each node. With 
the online performance data produced by the request tracing 
module, we can decide the dominated tier by checking service 
time percentage of each tier in different patterns. Then, we 
mainly observe the effect of DVFS modulation of the node, 
on which the dominated tier are deployed. In this way, we can 
ease the establishment of the performance model. 

For different load levels, which can be provided by the 
request tracing module, the controller module quickly calcu- 
lates the optimal setting of DVFS modulation according to 
the performance model created by the performance profiling 
module. For adaptation, the controller module only relies on 
the single-node DVFS modulation at a time as opposed to 
varying multiple CPU frequencies simultaneously. 

B. Online Request Tracing 

Our previous work PreciseTracer [8] [9] has described how 
to design and implement an online request tracing system for 
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Fig. 2. PowerTracer system architecture 



multi-tier services of black boxes. Black boxes indicate that we 
do not need to have source code of application or middleware. 
In this paper, our contribution is how to exploit online request 
tracing for accurate DVFS control in multi-server clusters. 

On a basis of PreciseTracer, we develop a new component 
called Analyzer, which classifies a large variety of causal paths 
into different patterns, then extracts online performance data 
for main patterns according to their fractions. 

Our classification policy of causal paths addresses the 
following challenging problems: first, it is difficult to utilize 
individual causal path from massive request traces as the 
guide for DVFS modulation; second, causal paths are different 
and the overall statistics of performance information (e.g. the 
average server-side latency used in [13]) of all paths hides the 
diversity of patterns. There are several ways of classifying 
causal paths. For example, in our previous work [9], we 
classify causal paths into different patterns according to their 
shapes. In this paper, we found that classifying causal paths 
according to the size of the first message sent by a client 
performs well in producing patterns. Thus, we use a k-means 
cluster method [14] to classify causal paths into patterns based 
on the size of the first message sent by a client. 

We use a 5-tuple <pattem ID, the size of the first message 
send by a client, the average server-side latency of pattern 
ID, the average service time per tier, the current load> to 
represent the online performance data of a pattern in a multi- 
tier service. In our previous work [9], for RUBiS workload 
[10], we observe that there are more than hundred patterns, but 
the top ten patterns take up a large fraction of paths, more than 
88%. Therefore, we single out the top A^ patterns according to 
their fractions as the guide for building both the performance 
model and the controller module. 

C. Performance profiling 

The purpose of performance profiling is to create an empir- 
ical performance model, which is used for the fast modulation 
procedure. The power model can be represented by Equation 
(1) as follows. 
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N is the number of major patterns singled out by our 
request tracing system. In our experiments in Section V, we 
will observe the effect of different number of main patterns 
on the power saving. M is the tiers of a multi-tier service, 
namely three in this paper. DL^i{i = 1,2, ..., N) is the average 
server-side latency of pattern i when the current load is L. 
lL.i{i=^,---,N) is the network latency of pattern i when the 
current load is L. Function fL.i.j(Fj) represents the average 
service time of each tier j in pattern i when the clock 
frequencies run with Fj(j=l,...,M) and the current load is L. 

For each load level, the function set / is derived as follows: 
first, through our tracing system, we measure the service 
time of each tier while traversing all the four discrete CPU 
frequency values offered by each server; second, we derive 
functions fL.i.j between the average service time of each tier 
in major patterns and the CPU frequencies of nodes by the 
normal quadratic polynomial fitting tool. The models fit well, 
and the fitting coefficient R^>91%. 

Through request tracing and performance profiling, we can 
ease the establishment of the empirical performance model. 
For each workload, e.g. given the transition table and the 
number of current clients, we create a function set / offline. 
For a three-tier service deployed on the testbed that offers 
four CPU frequency levels, if we consider 10 different load 
levels (from zero to the upper bound of the load), we need 
to do 10 X 4 X 3 times of experiments to gain the function 
set /. Since our request tracing tool can obtain the service 
time of each tier in major causal path patterns and decide 
the dominated tiers in terms of the service time percentage of 
each tier, our system can effectively decrease the time cost of 
performance profiling experiments by mainly scaling the CPU 
frequencies of the tiers that dominate the server-side latency. 
For example, for RUBiS, the average service time percentage 
of the web server, the application server, and the database 
server are 0.11%, 17.63%, and 82.26%, respectively. So, we 
consider the database server as the dominated tier In this way, 
we can decrease the times of experiments to 10 x 4 x 1. 

D. Controller design 

In our feedback controller, the target system is the multi-tier 
server cluster. The control inputs are the new clock frequency 
vector The measured outputs are the server-side latencies of 
the top main N patterns, which are presented as a vector 
as well. We define THi (i=l,...,N) as the server-side latency 
threshold of pattern i. The reference inputs are the desired 
server-side latency threshold zones for the main patterns. LP 
and UP represent the lower and upper threshold factor. The 
controller module makes the measured outputs fall within the 
latency threshold zone by adjusting the values of the control 
inputs periodically. 

The controller module consists of two main procedures: fast 
modulation and step modulation. During the fast modulation 
procedure, by tracing the current load, PowerTracer can decide 
the current load level and compute out the optimal clock 
frequency of each server with reference to the given server- 
side latency threshold zones of the main patterns according to 



Equation (1). 

After the fast DVFS modulation, it is the step of modulation 
loop. In our system, step modulation periods are composed of 
alternate sampling periods and control periods. We model our 
system as Equation 2. 



D{t + l) = F{D{^.M) 



(2) 



D{t) is a vector that represents the average server-side la- 
tencies of each pattern in the tth sampling period. f{t) is a 
vector that represents the CPU frequency levels of the nodes 
in the tth period. F(t) is a vector that represents the transition 
functions of each pattern from states of the tth period to that 
of the (i + l)th period. 

We design the step modulation procedure as follows. In the 
tth sampling period, we use the request tracing module to 
obtain 5 -tuple online performance data of the top N patterns. 
By comparing the measured server-side latencies of the top 
N patterns with the latency threshold zones, the controller 
module modulates the DVFS setting based on the following 
procedure. 

For pattern i, if Di{t) exceeds its upper threshold UP*THi, 
the controller module chooses tier j that has the maximum 
service time to step up its CPU frequency and record the 
new frequency value into the frequency vector The frequency 
values of the other tiers remain intact. For any pattern i 
(i^l,..,N) D,{t) fall below their lower threshold LP*TH„ 
the controller module chooses tier j that has the minimum 
service time to step down its CPU frequency and record the 
new frequency into the frequency vector. The frequency values 
of the other tiers remain intact. 

The approach that we take the measured server-side laten- 
cies of the top A^ main patterns as the controlled variables, 
enables the majority of requests meet service-level agreement. 
A small percentage of requests is ignored so as to achieve the 
most power reduction. The step modulation procedure can be 
formally defined in Equation 3. 

7,(i + 1) = fj{t) + 1, 3i\{D,{t) > UP * TH,)A 

{STi,j{t) is_the_maximum); 

fj{t + l)=fj{t)-l, yi\{D,{t)<LP*TH^)A (3) 

{STi_j{t) is_the_minimum)] 

fj{t + 1) = fj{t), otherwise 

In Equation 3, STij{t) represents the service time of tier j 
in pattern i in the tth period. For fj, +1 indicates stepping up 
CPU frequency with one level, while —1 indicates stepping 
down CPU frequency with one level. 

IV. System Implementation 

As showin in Fig. 3, PowerTracer includes four major com- 
ponents: Tracer, Power Modeler, Controller and Scaler. 

On a basis of our previous work [9], we implement the 
request tracing system described in Section III-B and integrate 
it into PowerTracer as a module called Tracer As shown 
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Fig. 3. The architecture diagram of PowerTracer. 



in Fig. 3, Tracer reads all service logs and outputs 5-tuple 
performance data of the top A^ patterns. 

Through changing the DVFS modulation according to Sec- 
tion III-C, PowerModeler analyzes the 5-tuple performance 
data of the top N patterns, and then outputs the power model 
in the file Pre_model. 

For different load levels. Controller runs the fast modulation 
based on the Pre_model file. Then, Controller runs step 
modulation loops, which are composed of alternate sampling 
and control periods. In the sampling period. Tracer is called to 
output 5-tuple performance data for the top N patterns. In the 
control period. Controller makes decisions on changing clock 
frequencies as the approach presented in Section III-D. 

Scaler, which runs on each tier, is called by Controller to 
set clock frequencies. The clock frequency setting is invoked 
by setting frequency scaling governor of the Linux kernel and 
recording new frequency into scaling_setspeed file. 

V. Evaluation 

We use a 3-tier web appUcation RUBiS [10] to evaluate the 
efficiency of our approach in terms of system performance and 
power savings. RUBiS is a three-tier auction site prototype 
modeled after eBay.com, developed by Rice University. 

A. Experimental Setup 

The testbed is a heterogeneous 4-node platform composed 
of Linux-OS blade servers, which we name node A, node B, 
node C, and node D, respectively, as shown in Fig. 4. All 
four nodes are DVFS-capable. Nodes A and B both have 2 
capable processors, which support the frequencies at LO, 1,8, 
2.0 and 2.2GHz. Node C has 8 processors, which support 
the frequencies at 0.8, LI, L6 and 2.3GHz. Node D has 
2 AMD Opteron (tm) Processors. We use QINGZHI 877X 
power analyzer to measure the power consumption. As shown 
in Fig. 4, the platform is deployed with 3-tier web servers as 
follows: 

• Node A is deployed as the web server tier, which runs 
Apache ( server version 2.2.13). 

• Node B is deployed as the application server tier, which 
runs JBoss enterprise Java application (version 4.2). 

• Node C is deployed as the database server tier, which 
runs MySQL ( server version 5.0.45) database. 

• Node D runs RUBiS client emulator, which generates and 
sends requests to the Apache server. 




Fig. 4. The deployment diagram of 3-tier platform. 

B. Workload 

We use the RUBiS client emulator to generate workload by 
adjusting the parameters, e.g. the number of clients and transi- 
tion tables. In our experiments, we use two kinds of transition 
tables, which emulate read_only workload (browse_only table) 
and read_write mixed workload (transition table),respectively. 
For the read_only workload, we set the number of clients 
to 100, 200, 300, 400, and 500, respectively. For the mixed 
workload, in order to enhance experimental contrast effects, 
we set the number of clients to 500, 600, 700, 800, and 900, 
respectively. Each workload includes three stages, of which 
we set up ramp time, runtime session, and down ramp time 
as 10 seconds, 300 seconds, and 10 seconds, respectively. 

As shown in Fig. 4, Tracer and Scaler are deployed on all 
three servers. For Controller, we only deploy it on Node A 
rather than the other two servers. Note that the service delay 
at the web server tier (Node A) imposes the least impact on 
requests' performance. 

In the experiments, our prototype system is simplified in 
two aspects. First, we only modulate clock frequencies of 
the last two tiers (Jboss and Mysql servers) and leave the 
first tier running with its lowest clock frequency all the time. 
Second, we utilize a constant average value, which is gained 
by experiments, to represent the network latencies among the 
three tiers during the fast modulation procedure. The rationale 
behind the simplification is as follows: first, our experiments 
show that the service time of the first tier (Apache server), 
which is 10^^ to 10^^ milliseconds on average, contributes 
little to the server-side latency, which is 10 to 100 millisecond; 
second, current commercial network interface cards are not 
energy proportional, and we cannot change its setting as we 
modulate CPU frequencies to save power consumption. 

C. Experimental Results 

We use three metrics to evaluate our system: total system 
power savings compared to the baseline, request deadline 
miss ratio, and average server-side latency. We set the clock 
frequency of all servers to the maximum as the baseline. Note 
that the power consumption under the baseline is not fixed 
for different load levels, and higher load will lead to higher 
power consumption even with the same clock frequency. We 
assume the server-side latency under the baseline is SL. 
For PowerTracer, SL is a vector representing the server- 
side latencies of the chosen main patterns. For the request 
deadline miss ratio metric, we predefine a server-side latency 
deadline, which is under the SLA constraints. We compare the 
results of PowerTracer with those of the two other algorithms: 
the SimpleDVS algorithm presented by Horvath et al. [13] 



and the Ondemand governor offered by Linux kernel. The 
SimpleDVS algorithm takes CPU utilization as the indicator 
in determining which server's clock frequency should be 
scaled. The Ondemand governor, the most effective power 
management policy offered by Linux kernel, keeps the CPU 
frequency low when not needed, and instantly jumps back to 
full power when required [23]. In particular, we implement a 
modified version from PowerTracer, called PowerTracer_NP, 
which chooses the average server-side latency as the measured 
output, instead of N individual server-side latencies of the top 
N patterns. 

For all experiments, we set the sampling period as 1 second 
and the sampling interval as 5 seconds. The control period is 1 
second. We set the upper latency threshold factor, UP, and the 
lower latency threshold factor, LP, as 1.2 and 0.8, respectively. 
We trace the servers' performance statistics while setting the 
servers with different frequency values for different workload, 
and use the normal quadratic polynomial fitting to derive the 
performance model described in Section III-C. 

Note that the maximum power savings we can achieve in 
three nodes (node A, node B and Node C) is 17.68% when we 
set the clock frequencies of the three servers to the minimum 
without considering performance. We do not consider node D, 
because it is used to generate client requests. 
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Fig. 5. A comparison of the total system power savings. The x-axis represents 
the number of clients set by RUBiS client emulator. The y-axis represents the 
total system power savings compared to the baseline. 
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Fig. 6. A comparison of the average server-side latencies. The y-axis 
represents the average server-side latency of all requests. 

1) Read_only workload: Fig. 5, Fig. 6, and Fig. 7 present 
the total system power savings, the average server-side la- 
tencies, and the request deadline miss ratios of the four 
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Fig. 7. A comparison of the request deadline miss ratios. The y-axis 
represents the deadline miss ratio of all requests and the blank represents 
zero. The deadline is set to 0.5 second for RUBiS read_only workload. 
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Fig. 8. The total system power savings of PowerTracer under five latency 
thresholds. The x-axis represents five latency threshold levels. 



algorithms, respectively, when the number of clients varies 
from 100 to 500. In this set of experiments, we set the 
server-side latency threshold as 3 x SL. For PowerTracer, the 
number of main patterns is 5, which indicates that the top 5 
patterns are used as the guide for DVFS control. PowerTracer 
or PowerTracer_NP gains the highest power reduction. The 
Ondemand governor gains the lowest power reduction except 
when the number of clients is 500. PowerTracer outperforms 
PowerTracer_NP except when the number of clients is 300. 
When the number of clients is 500, PowerTracer gains the 
maximum power saving of 15.60% compared to the baseline, 
which is about 147% better than SimpleDVS^, and about 74% 
better than the Ondemand governor in terms of power saving. 
We also observe that Ondemand has poor performance in terms 
of both the average server-side latency and the request deadline 
miss ratio when the number of clients is 500. This is because 
the Ondemand governor of each server scales CPU frequencies 
without any coordination. 

From Fig. 8 to Fig. 10, we demonstrate how the performance 
varies when the reference input in PowerTracer — the server- 
side latency threshold, varies. We set the number of clients to 
500 and select five main patterns according to their fractions 
from the top seven patterns. For the top seven patterns, SL is 
(2.583734, 20.585, 25.88851, 63.86756, 74.38303, 63.06882, 
66.01713) milliseconds. We set five latency thresholds as 1.5, 

^Note that in [13] , when the SimpleDVS algorithm is performed on 
the three laptop computers with Mobile AMD Athlon XP DVS-capable 
processors, and the authors reported that the TPC-W service consumes as 
much as 30% less energy compared to the same baseline as in our paper. 



70X - 
60X - 
50» — 
40» — 
30% — 
20» - 
10% - 
00% — 




Server-side latency threshold 

Fig. 9. The request deadline miss ratios of PowerTracer under five different 
latency thresholds. The deadline is set to 0.5 second as above. 
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Fig. 10. The average server-side latencies of PowerTracer under five latency 
thresholds. 



2, 2.5, 3, and 3.5 x SL, respectively. From Fig. 8 to Fig. 10, 
we can see that when the latency threshold is 2.5 x SL, 
PowerTracer is about 43% better than that when the the 
latency threshold is 1.5 x SL in terms of the power saving. At 
the same timer, under these two configurations, their server- 
side latencies and their request deadline miss ratios are close 
to each other Therefore, choosing an appropriate latency 
threshold is critical to PowerTracer 

Fig. 11 ~ 13 show the performance of PowerTracer when 
a different number of main patterns is chosen out of the top 
seven major patterns. For these experiments, we set the number 
of clients to 500 and the latency threshold to 3 x SL. Fig. 11 
~ 13 show that if we set the number of main patterns to 
one, PowerTracer gains the highest power saving, but also the 
highest server-side latency and the highest request deadline 
miss ratio. Comparing the performance of one main pattern 
and that of five main patterns, we can see that choosing more 
patterns does not necessarily improve power savings and other 
metrics, and hence, we need to choose the number of main 
patterns based on different workloads so as to achieve the 
optimal power saving and performance. 

2) Read_write mixed workload: Fig. 14, Fig. 15 and Fig. 16 
present the total system power savings, the request deadline 
miss ratios, and the average server-side latencies gained by 
the four algorithms, respectively, when the number of clients 
varies from 500 to 900. In these experiments, we set the 
number of the main patterns to 3 and the latency threshold 
to 5 x SL. We can see that PowerTracer or PowerTracer_NP 
gains the highest power saving. PowerTracer outperforms 
PowerTracer_NP except when the number of clients is 500. 
The Ondemand governor has the lowest power reduction. 
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Fig. 11. A comparison of the total system power savings with different 
number of main patterns. The x-axis represents the chosen number of the 
main patterns. 
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Fig. 12. A comparison of the request deadline miss ratios with different 
number of main patterns. 



When the number of clients is 600, PowerTracer achieves the 
maximum power saving of 10.88% compared to the baseline, 
which is about 74% better than SimpleDVS and about 515% 
better than the Ondemand governor in terms of power saving. 

Fig. 17 ~ Fig. 19 illustrate how the performance varies when 
the reference inputs in PowerTracer — the latency threshold 
varies. For this set of experiments, we set the number of clients 
to 500 and select three main patterns out of the top eight 
patterns. For the top eight patterns, 5X is (0.15625, 9.207284, 
13.67607, 11.8227, 28.14555, 22.5599, 22.7111, 22.50751) 
milliseconds. Fig. 17 ~ Fig. 19 indicate that when the latency 
threshold is 5 x SL, PowerTracer is only about 8% better 
than that when the latency threshold is 3 x SL in terms of 
the power saving. Meanwhile, under these two configurations, 
their server-side latencies and their deadline miss ratios are 
close to each other. 

Fig. 20 ~ Fig. 22 present the performance of PowerTracer 
when a different number of main patterns is chosen out of the 
top eight patterns. For these experiments, we set the number of 
clients to 500 and the latency threshold to 5 x SL. Our results 
show that for the Read_write mixed workload, PowerTracer 
achieves the optimal results, i.e., the higher power saving, the 
lower miss ratio, and the lower server-side latency when we 
set the number of main patterns to three. 

D. Discussion 

For two different workloads of RUBiS, the experiment 
results show that PowerTracer or PowerTracer_NP outper- 
forms its peer, indicating that request tracing can improve 
the accuracy of DVFS control. At most of time, PowerTracer 
outperforms PowerTracer_NP. This implies that monitoring the 
performance data of main patterns, instead of an average one. 
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Fig. 13. A comparison of the average server-side latencies with different 
number of main patterns. 
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Fig. 14. A comparison of the total system power savings. 



can also improve the accuracy of DVFS control. However, 
the optimal number of main patterns depends on different 
workloads. We also observe that setting a higher latency 
threshold under a certain limit in PowerTracer improves power 
savings. 

Generality. Our system is dependent on server-side laten- 
cies and service time per tier of the top A^ patterns of a multi- 
tier service, which can be obtained with either black-box or 
white-box request tracing approaches. While PowerTracer is 
built on a basis of the black-box request tracing approach, our 
system can be built upon other white-box or black-box ones, 
like [18] [19] [20] [21] [22] [46]. 

Scalability. In our previous work PreciseTracer [9], we have 
demonstrated how to improve system scalability through two 
mechanisms: tracing on demand and sampling. Besides, our 
experiments in [9] show that PreciseTracer has fast respon- 
siveness, and imposes negligible effects on the throughput 
and the average response time of services, like RUBiS. On a 
basis of those features, PowerTracer is promising in its system 
scalability. 

Potential for power saving. So far, in most of commercial 
servers, CPU is the only energy proportional component, and 
our work is also confined by this limitation. Barroso et al. 
[17] showed that four components, including CPU, DRAM, 
disk, and network switches, are the main sources of power 
consumptions in data center. Therefore, we believe that our 
accurate DVFS control will play a more important role in 
saving power consumption when the concept of DVFS is 
extended to the other system components. 
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Fig. 15. A comparison of tlie request deadline miss ratios. The deadline 
target is set to 0.2 second for read_write mixed workload. 
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Fig. 16. A comparison of the average server-side latencies. 



Fig. 17. The total system power savings of PowerTracer under two latency 
thresholds. 
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Fig. 18. The request deadline miss ratios of PowerTracer under two latency 
threshold. The deadline is set to 0.2 second. 



VI. Related Work 

The related work can be classified into four categories: 
DVFS in server clusters, dynamic cluster reconfiguration, 
virtual machine based server consolidation and power provi- 
sioning. 

DVFS in server clusters. The closest work to this paper is 
[13] by Horvath et al. They proposed a coordinated distributed 
DVS policy based on a feedback controller for three-tier web 
server systems. However, their work fails to provide accurate 
DVFS control for two reasons: first, the simple DVS algorithm 
uses CPU utilization as the indicator in determining which 
server's clock frequency should be scaled, while the optimized 
algorithm is difficult to be applied because of its complexity; 
second, the two algorithms take the average server-side latency 
of all requests as the controlled variable, while our experi- 
ments show that massive requests have a number of different 
patterns. In [5], Horvath et al. proposed a multi-mode energy 
management for multi-tier server clusters, which exploited 
DVS together with multiple sleep states. In [6], Horvath et 
al. invented a service prioritization scheme for multi-tier web 
server clusters, which assigned different priorities based on 
their performance requirements. In [16], Chen et al. developed 
a simple metric called frequency gradient that can predict the 
impact of changes in processor frequency upon the end-to-end 
transaction response times of multi-tier applications. 

Dynamic cluster reconfiguration. K. Rajamani et al. [31] 
improved energy efficiency by powering down some servers 
when the desired quality of service can be met with fewer 
servers. M. Elnozahy et al. [25] used request hatching to con- 
serve energy during periods of low workload intensity. Facing 
challenges in the context of connection servers, G. Chen et al. 



[7] designed a server provisioning algorithm to dynamically 
turn on a minimum number of servers, and a load dispatching 
algorithm to distribute load among the running machines. In 
integrating independently energy saving policies, J. Heo et al. 
[26] presented a mechanism, called adaptation graph analysis, 
for identifying potential incompatibilities between composed 
adaptation policies. 

Virtual machine based server consolidation. G. Dhiman 
et al [36] indicated that co-scheduling VMs with heteroge- 
neous characteristics on the same physical node is beneficial 
from both energy efficiency and performance point of view. 
Y. Wang et al. [24] proposed Virtual Batching, a novel request 
batching solution for virtualized servers with primarily light 
workloads. Y. Wang et al. [28] proposed a two-layer control 
architecture based on well-established control theory. X. Wang 
et al. [38] proposed Co-Con, a cluster-level control architecture 
that coordinates individual power and performance control 
loops for virtualized server clusters. P. Padala et al. [42] de- 
veloped an adaptive resource control system that dynamically 
adjusts the resource shares to individual tiers in order to meet 
application-level QoS goals . In their later work [43], P. Padala 
et al. present AutoControl, a resource control system that 
automatically adapts to dynamic workload changes to achieve 
application SLOs. 

Power provisioning. Related work focuses on the perfor- 
mance optimization problem under power constraints, while 
our work addresses the power optimization problem under 
performance constraints [5]. P. Ranganathan et al. [4] proposes 
power efficiencies at a larger scale by leveraging statistical 
properties of concurrent resource usage across a collection of 
systems (ensemble). C. Lefurgy et al. [1] present a technique 
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Fig. 20. A comparison of the total system power savings with different 
number of main patterns. 
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Fig. 22. A comparison of the average server-side latencies with different 
number of main patterns. 



that controls the peak power consumption of a high-density 
server . X. Wang et al. [2] [27] propose a cluster-level 
power controller that shifts power among servers based on 
their performance needs, while controlling the total power 
of the cluster to be lower than a constraint. S. Pelley et 
al.[29] develop mechanisms to better utilize installed power 
infrastructure. S. Govindan et al. [35] explore a combination 
of statistical multiplexing techniques to improve the utilization 
of the power hierarchy within a data center. R. Raghavendra 
et al. [34] propose and validate a power management solution 
that coordinates different individual energy-saving approaches. 
X. Fan et al. [33] present the aggregate power usage charac- 
teristics of large collections of servers (up to 15 thousand) for 
different classes of applications over a period of approximately 
six months, and modeling to attack data center-level power 
provisioning inefficiencies. 

VII. Conclusion 

In this paper, we have proposed a novel request tracing 
approach for cluster-level DVFS control. A request tracing 
tool can characterize major causal path patterns in serving 
different requests and capture server-side latency, especially 
service time of each tier in different patterns. The advantage 
of the request tracing approach is two-fold: first, it decreases 
the time cost of performance profiling experiments; second, 
it decreases the controller complexity so that we introduce a 
simpler feedback controller, which only relies on the single- 
node DVFS modulation at a time. Based on the request 
tracing approach, we have presented a hybrid DVFS control 
algorithm that combines an empirical performance model 
for fast modulation at different load levels and a simpler 
controller for adaption. We have developed a prototype of 



the proposed system, called PowerTracer, and conducted real 
experiments on a 3-tier platform to evaluate its performance. 
Our experimental results show that PowerTracer outperforms 
its peer [13] in terms of power saving. 
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